While NumPy is built on C, certain compute-intensive algorithms hit a vectorization wall. This occurs when the inherent latency of Python's dynamic nature outweighs the benefits of high-level abstraction.
1. The Interpreter Tax & Boxing
Every iteration in a standard Python loop involves dynamic type-checking and reference counting. Even when using NumPy scalars, the "boxing" of raw C-data into Python objects creates a massive bottleneck for functions like $\text{logit}(p) = \log(p/(1-p))$. Handling edge cases in C is drastically faster:
>>> logit(0) -> -inf >>> logit(1) -> inf >>> logit(2) -> nan >>> logit(-2) -> nan
2. Intermediate Array Bloat
Pure NumPy expressions create temporary memory buffers for each sub-operation. Extending via the C-API allows for Kernel Fusion, where the logit transform is calculated in a single pass without auxiliary memory overhead.
3. Spatial Dependencies
Operations involving neighbor-access patterns, such as the 2D stencil:
$$B(I, J) = A(I, J) + (A(I-1, J) + A(I+1, J) + A(I, J-1) + A(I, J+1)) \cdot 0.5D0 + (A(I-1, J-1) + A(I-1, J+1) + A(I+1, J-1) + A(I+1, J+1)) \cdot 0.25D0$$
are difficult to express efficiently via slicing without redundant memory copies. C-extensions allow for direct, cache-aligned pointer arithmetic.